Text-Image Topic Discovery for Web News Data

نویسنده

  • Mingjie Qian
چکیده

We formally propose a new application problem: unsupervised text-image topic discovery. The application problem is important because almost all news articles have one picture associated. Unlike traditional topic modeling which considers text alone, the new task aims to discover heterogeneous topics from web news of multiple data types. The heterogeneous topic discovery is challenging because different media data types have different characteristics and structures, and a systematic solution that can integrate information propagation and mutual enhancement between data of different types in a principle way is not easy to obtain, especially when no supervision information is available. We propose to tackle the problem by a regularized nonnegative constrained l2,1-norm minimization framework. We also present a new iterative algorithm to solve the optimization problem. To objectively evaluate the proposed method, we collect two real world text-image web news datasets. Experimental results show the effectiveness of the new approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Appendix for Text-Image Topic Discovery for Web News Data

In this appendix, we will give detailed derivation, proof of convergence, complexity analysis, parameter analysis, and convergence curves for the proposed method. I. FORMULATION We formalize joint text-image topic discovery by the following optimization problem: min F≥0,G≥0 ∥∥X−GFT∥∥ 2,1 + λ 2 M ∑

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Information Discovery based on Multi-granularity Text Fusion

In this paper we introduce a new information discovery algorithm Multi-granularity Text Fusion (MGTF) on the Web. Granularity means the length of News relevant web documents, such as News web pages, Blog and Micro Blogs, which comes from web uses. The longer the text is, the higher of the granularity it has. Given a topic query on the Internet and the results of different granularity and time-s...

متن کامل

Discovering and Tracking Events From News, Blogs and Microblogs on the Web

Using three data sources, news, blogs, and microblogs, this study proposes a framework for discovering and tracking events embedded in free form online text. Existing methods for text mining are discussed for the three sources. Because three sources have different perspective, event analysis, region-topic model and rare keywords are proposed respectively. In order to integrate three data source...

متن کامل

Automatic Image Annotation Using Semantic Text Analysis

This paper proposed a method to find annotations corresponding to given CNN news documents for detecting terrorism image or context information. Assigning keywords or annotation to image is one of the important tasks to let machine understand web data written by human. Many techniques have been suggested for automatic image annotation in the last few years. Many researches focused on the method...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014